SOC128D - Analytics for a Changing Climate: Introduction to Social Data Science
Final Project Report
To gain a preliminary understanding of the relationship between high temperature and mental health, we conducted a literature review. Firstly, Thompson et al. (2018) conducted a systematic review to illuminate the association between high temperatures and mental health. Through their review, we noted the different categories of mental health that others have explored, namely (A) Suicide (B) Bipolar Disorder (C) Mania, Depression and Heat (D) Organic mental health outcomes, including dementia and (E) Alcohol and substance misuse. Given the limited timeframe of this project, it was not possible to cover all 6 areas and hence we decided to focus on the area of suicide, especially given the relatively larger number of studies that discuss about this aspect as compared to the rest. Notably, the authors underscored the two different views on the relationship between high temperature and mental health, one of which believes there’s a causal relationship between them and the other does not believe that it is the case.
On a macro-level, a study in Australia done between 1970 and 2007 found that males, aged between 30-49 and lived in rural areas, experienced an increase in the relative risk of suicide (Hanigan et al., 2012) (https://doi.org/10.1073/pnas.1112965109). The increase coincided with an increase in drought index between the first and third quartile of every year in the same period, suggesting that there might be a causal relationship between abnormal temperature patterns (which is closely related to drought index) and mental health. Similarly, Deisenhammer et al. (2003) (https://doi.org/10.1046/j.0001-690x.2003.00219) found that high ambient temperatures were closely associated with a range of mental health effects, with the most compelling evidence found for increased suicide risk. On a micro-level, Lohmus (2018) (https://doi.org/10.3390/ijerph15071515) suggested that high temperatures may lead to poor sleep quality and sleep deprivation, both which may contribute to the development of mental health disorders and persistence of existing mental health disorders.
On the other hand, a study analysing experiments, which purported close links between rising temperatures and suicide risk, found that many of these studies did not consider long term climate patterns or lacked sufficient sample size for the results to be generalisable (Dixon et al., 2007) (https://doi.org/10.1007/s00484-006-0081-4). Researchers of the same study also employed simple linear regression to highlight that links between monthly suicide rates and mean suicide rates for certain American counties were weak. In the same vein, Thompson et al. (2018) (https://doi.org/10.1016/j.puhe.2018.06.008) concluded that there was limited evidence for increased heat-related morbidity and deaths amongst those with mental health disorders and acknowledged gaps existed in knowledge about the dynamics between heat and mental health-related deaths.
Recognizing the differing viewpoints on the causal effects of high temperature on mental health, we decided to embark on a data science research to ascertain this effect with the use of data specific to California. Prior to trend analysis, we scoured through the Internet in search for datasets and we were fortunate to be able to obtain a considerable amount of data.
For instance, from the National Centers for Environmental Information (https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/county/time-series/), we able to obtain monthly and yearly temperature-related data (e.g., Maximum Temperature, Minimum Temperature, Average Temperature, Precipitation) at a statewide and county-specific level. In addition, we were able to obtain datasets on measures of mental health (e.g. Suicide Rate (Per 100,000 people), Rate of hospitalization for mental health issues) from sources such as Mental Health Oversight & Accountability Commission (https://mhsoac.ca.gov/transparency-suite/suicide-incidence-and-rate/) and Kids Data (Population Reference Bureau) (https://www.kidsdata.org/topic/715/mental-health-hospitalizations-age/trend#fmt=2342&loc=2&tf=5,110&ch=1309&pdist=7).
Upon the compilation and cleaning of our data set, we began our study by analyzing general trends of temperature (independent variable) and mental health data (dependent variable) over time. Temperature was measured through maximum temperature (C), minimum temperature (C), average temperature (C) and precipitation (inch), and mental health was represented using relevant indicators, namely suicide rate (per 100,000) and the rate of hospitalization for mental health issues (per 1,000). Besides analyzing general trends, we explore different means of visualizations to convey the messages presented by our data and to see if there are any (advantageous) differences in the various visualization.
Statewide Yearly Data
We first started by first looking at statewide yearly data of the temperature indicators in California and used line graphs to display changes in maximum, minimum and average temperature over the years.
california_climate_overall = read.csv("proj data/[Overall] California Avg, Mix, Max Temp and Precipitation.csv")
ggplot(data=california_climate_overall, aes(x=YEAR, y=MAX.TEMP..C.)) +
geom_line() +
geom_point(aes(color = MAX.TEMP..C.)) +
scale_color_gradient(low = "blue", high = "red") +
labs(title="Graph of Max Temperature (Celsius) vs Year",
x ="Year", y = "Max Temperature (Celsius)", color = "Max Temperature (C)")+ #label of x-axis, y-axis and title
theme(plot.title = element_text(hjust = 0.5)) + #center title
stat_smooth(method = "lm",
formula = y ~ x,
geom = "smooth") +
ylim(0, 30)
ggplot(data=california_climate_overall, aes(x=YEAR, y=AVG.TEMP..C...2dp.)) +
geom_line() +
geom_point(aes(color = AVG.TEMP..C...2dp.)) +
scale_color_gradient(low = "blue", high = "red") +
labs(title="Graph of Average Temperature (Celsius) vs Year",
x ="Year", y = "Average Temperature (Celsius)", color = "Average Temperature (C)") + #label of x-axis, y-axis and title
theme(plot.title = element_text(hjust = 0.5)) + #center title
stat_smooth(method = "lm",
formula = y ~ x,
geom = "smooth") +
ylim(0, 30)
ggplot(data=california_climate_overall, aes(x=YEAR, y=MIN.TEMP..C.)) +
geom_line() +
geom_point(aes(color = MIN.TEMP..C.)) +
scale_color_gradient(low = "blue", high = "red") +
labs(title="Graph of Min Temperature (Celsius) vs Year",
x ="Year", y = "Min Temperature (Celsius)", color = "Min Temperature (C)") + #label of x-axis, y-axis and title
theme(plot.title = element_text(hjust = 0.5)) + #center title
stat_smooth(method = "lm",
formula = y ~ x,
geom = "smooth") +
ylim(0, 30)
From Figures 1-3, we observed a general increasing trends in maximum, minimum and average temperatures from 1925 to 2021, which corresponded with the lines of best fit obtained for each graph. This result suggests the gradual warming of California over the past few decades.
Statewide Monthly Data
Next, we look at statewide monthly data, and observed the data through various types of visualizations - heatmaps, box and whisker plots and ridgeline graphs.
california_climate_monthly = read.csv("proj data/[Overall] California Monthly Temperature.csv")
california_climate_monthly$MONTH <- factor(california_climate_monthly$MONTH, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) # to reorder y-axis/month
Heatmaps allow for the visualization of data through the use of colours to represent the different temperature. Recognising the sheer size of the monthly dataset we obtained, We utilized multiple heatmaps to visualize monthly data. Similar to the previous section (statewide yearly data), we plotted the high, low and average temperatures.
ggplot(california_climate_monthly, aes(x = MONTH, y = YEAR, fill = HIGH_C)) +
geom_tile(color = "black") +
scale_fill_viridis_c(option = "magma", name = "Temperature") + # You can choose any other color scale
labs(title = "California's Monthly High Temperature",
x = "Year",
y = "Month") +
theme(plot.title = element_text(hjust = 0.5)) #center title
ggplot(california_climate_monthly, aes(x = MONTH, y = YEAR, fill = LOW_C)) +
geom_tile(color = "black") +
scale_fill_viridis_c(option = "magma", name = "Temperature") + # You can choose any other color scale
labs(title = "California's Monthly Low Temperature",
x = "Year",
y = "Month") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(california_climate_monthly, aes(x = MONTH, y = YEAR, fill = AVG_C)) +
geom_tile(color = "black") +
scale_fill_viridis_c(option = "inferno", name = "Temperature") + # You can choose any other color scale
labs(title = "California's Monthly Average Temperature",
x = "Year",
y = "Month") +
theme(plot.title = element_text(hjust = 0.5))
While we observe visible changes in temperature across the months from the heatmaps of monthly temperature, it is hard to discern the general changes in temperature across the years.
Box and whisker plots were also used to visualize the spread and skewness of temperature data.
ggplot(california_climate_monthly, aes(x=YEAR, y=HIGH_C, group=YEAR, fill = as.factor(YEAR))) +
geom_boxplot() +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
labs(title = "Box Plot of High Temperature by Year",
x = "Year",
y = "High Temperature (Celsius)")
ggplot(california_climate_monthly, aes(x=YEAR, y=LOW_C, group=YEAR, fill = as.factor(YEAR))) +
geom_boxplot() +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
labs(title = "Box Plot of Low Temperature by Year",
x = "Year",
y = "Low Temperature (Celsius)")
ggplot(california_climate_monthly, aes(x=YEAR, y=AVG_C, group=YEAR, fill = as.factor(YEAR))) +
geom_boxplot() +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
labs(title = "Box Plot of Average Temperature by Year",
x = "Year",
y = "Average Temperature (Celsius)")
While the box and whisker plots provided us an idea of the range of and mean temperature yearly, it is difficult to discern the general trend of high, low & average temperatures over the years given the sheer amount of data presented and compactness of the different plots.
Ridgeline graphs were used to show the distribution of temperatures over time.
count_data <- california_climate_monthly %>%
group_by(YEAR, AVG_C) %>%
summarize(count = n())
## `summarise()` has grouped output by 'YEAR'. You can override using the
## `.groups` argument.
ggplot(count_data, aes(x = AVG_C, y = YEAR, height = count, group = YEAR)) +
geom_ridgeline_gradient(aes(fill = AVG_C), alpha = 0.6) +
scale_fill_gradient(low = "blue", high = "red")+
scale_fill_viridis_c(alpha = 0.6) +
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6), # Adjust the size of the legend text
plot.title = element_text(hjust = 0.5),
axis.text.y = element_text(size = 6))+
labs(title = "Ridgeline Graph of Average Temperature by Year",
#x = "Year",
#y = "Average Temperature (Celsius)",
x = "Average Temperature (Celsius)",
y = "Year",
fill = "Average Temperature")+
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6))
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
count_data <- california_climate_monthly %>%
group_by(YEAR, HIGH_C) %>%
summarize(count = n())
## `summarise()` has grouped output by 'YEAR'. You can override using the
## `.groups` argument.
ggplot(count_data, aes(x = HIGH_C, y = YEAR, height = count, group = YEAR)) +
geom_ridgeline_gradient(aes(fill = HIGH_C), alpha = 0.6) +
scale_fill_gradient(low = "blue", high = "red")+
scale_fill_viridis_c(alpha = 0.6) +
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6), # Adjust the size of the legend text
plot.title = element_text(hjust = 0.5),
axis.text.y = element_text(size = 6))+
labs(title = "Ridgeline Graph of High Temperature by Year",
#x = "Year",
#y = "Average Temperature (Celsius)",
x = "Average Temperature (Celsius)",
y = "Year",
fill = "Average Temperature")+
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6))
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
count_data <- california_climate_monthly %>%
group_by(YEAR, LOW_C) %>%
summarize(count = n())
## `summarise()` has grouped output by 'YEAR'. You can override using the
## `.groups` argument.
ggplot(count_data, aes(x = LOW_C, y = YEAR, height = count, group = YEAR)) +
geom_ridgeline_gradient(aes(fill = LOW_C), alpha = 0.6) +
scale_fill_gradient(low = "blue", high = "red")+
scale_fill_viridis_c(alpha = 0.6) +
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6), # Adjust the size of the legend text
plot.title = element_text(hjust = 0.5),
axis.text.y = element_text(size = 6))+
labs(title = "Ridgeline Graph of Low Temperature by Year",
#x = "Year",
#y = "Average Temperature (Celsius)",
x = "Average Temperature (Celsius)",
y = "Year",
fill = "Average Temperature")+
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6))
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
From Figures 10-12, there seems to be a gradual inclination/movement towards the right side as the years go by, signaling an increase in temperature with time.
County-Level Yearly Data
california_county_climate = read.csv("proj data/County_Level_California_Avg,_Mix,_Max_Temp_and_Precipitation.csv")
Line graphs were used to display changes in maximum, minimum and average temperature over the years, and different colours were used to represent the different counties in California.
ggplot(data = california_county_climate, aes(x = YEAR, y = AVERAGE.TEMP..C., group = COUNTY, color = as.factor(COUNTY))) +
geom_line() +
scale_color_discrete() + # Add more colors as needed
labs(title = "Line Graph of Average Temperature vs Year",
x = "Year",
y = "Average Temperature (Celsius)",
color = "County") +
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6), # Adjust the size of the legend text
plot.title = element_text(hjust = 0.5))
ggplot(data = california_county_climate, aes(x = YEAR, y = MAX.TEMP..C., group = COUNTY, color = as.factor(COUNTY))) +
geom_line() +
scale_color_discrete() + # Add more colors as needed
labs(title = "Line Graph of Max Temperature vs Year",
x = "Year",
y = "Max Temperature (Celsius)",
color = "County") +
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6), # Adjust the size of the legend text
plot.title = element_text(hjust = 0.5))
ggplot(data = california_county_climate, aes(x = YEAR, y = MIN.TEMP..C., group = COUNTY, color = as.factor(COUNTY))) +
geom_line() +
scale_color_discrete() + # Add more colors as needed
labs(title = "Line Graph of Min Temperature vs Year",
x = "Year",
y = "Min Temperature (Celsius)",
color = "County") +
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6), # Adjust the size of the legend text
plot.title = element_text(hjust = 0.5))
With reference to Figure 13, we noted that it is very difficult to discern the trend in average temperatures in the county level given the fluctuations of the line graph as well as the sheer number of counties involved. Similarly, this applies for the trends in maximum and minimum temperatures.
Smooth line (best-fit) graphs were used to better predict patterns of the data, as they remove the fluctuations and noise in data points.
ggplot(california_county_climate, aes(x = YEAR, y = AVERAGE.TEMP..C., color = COUNTY) ) +
geom_smooth(method = "lm", alpha = .15, aes(fill = COUNTY)) +
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6)) + # Adjust the size of the legend text
labs(title = "Graph of Average Temperature Best Fit Lines vs Year, Grouped by County",
x = "Year",
y = "Average Temperature (Celsius)")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(california_county_climate, aes(x = YEAR, y = MAX.TEMP..C., color = COUNTY) ) +
geom_smooth(method = "lm", alpha = .15, aes(fill = COUNTY)) +
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6)) + # Adjust the size of the legend text
labs(title = "Graph of Maximum Temperature Best Fit Lines vs Year, Grouped by County",
x = "Year",
y = "Maximum Temperature (Celsius)")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(california_county_climate, aes(x = YEAR, y = MIN.TEMP..C., color = COUNTY) ) +
geom_smooth(method = "lm", alpha = .15, aes(fill = COUNTY))+
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6)) + # Adjust the size of the legend text
labs(title = "Graph of Minimum Temperature Best Fit Lines vs Year, Grouped by County",
x = "Year",
y = "Minimum Temperature (Celsius)")
## `geom_smooth()` using formula = 'y ~ x'
It appears much easier to discern the trends in temperatures from Figures 16-18, with a large majority having an upward trend. However, it is still difficult to discern the trends (especially in the center) where the different lines (representing different counties) coincide with each other. At the same time, a lot of data/information is lost when we used smooth line graphs (i.e. lacks granularity).
County-Level Monthly Data
To dive in deeper and better observe the trends in each county, we used county-level monthly data to plot of the changes in temperature over time. Scatter plots, along with the facet_wrap() function, were used to view data of individual counties and prevent the over-cluttering of data points.
ggplot(california_county_climate, aes(x = YEAR, y = AVERAGE.TEMP..C., color = COUNTY)) +
geom_point() +
facet_wrap(~COUNTY, ncol = 29, scales = "fixed", shrink = TRUE) +
geom_smooth(method = "lm", se = FALSE, color = "black", size = 0.5, alpha = 0.6)+
theme(strip.text = element_text(size = 6), # Adjust the size of the facet labels and titles
axis.text.x = element_text(size = 3)) +
theme(legend.position = "none") +
labs(title = "Scatter Plot of Average Temperature vs Year, Grouped by County",
x = "Year",
y = "Average Temperature (Celsius)") +
theme(plot.title = element_text(hjust = 0.5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## `geom_smooth()` using formula = 'y ~ x'
ggplot(california_county_climate, aes(x = YEAR, y = MAX.TEMP..C., color = COUNTY))+
geom_point()+
facet_wrap(~COUNTY, ncol = 29, scales = "fixed", shrink = TRUE)+
geom_smooth(method = "lm", se = FALSE, color = "black", size = 0.5, alpha = 0.6)+
theme(strip.text = element_text(size = 6), # Adjust the size of the facet labels and titles
axis.text.x = element_text(size = 3))+
theme(legend.position = "none")+
labs(title = "Scatter Plot of Max Temperature vs Year, Grouped by County",
x = "Year",
y = "Maximum Temperature (Celsius)")+
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
ggplot(california_county_climate, aes(x = YEAR, y = MIN.TEMP..C., color = COUNTY))+
geom_point()+
facet_wrap(~COUNTY, ncol = 29, scales = "fixed", shrink = TRUE)+
geom_smooth(method = "lm", se = FALSE, color = "black", size = 0.5, alpha = 0.6)+
theme(strip.text = element_text(size = 6), # Adjust the size of the facet labels and titles
axis.text.x = element_text(size = 3))+
theme(legend.position = "none")+
labs(title = "Scatter Plot of Min Temperature vs Year, Grouped by County",
x = "Year",
y = "Minimum Temperature (Celsius)")+
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
Trends were much easier to discern from the above scatter plots. In general, there seems to be an upward trend, indicating a positive relationship between minimum, maximum or average temperature and time (year).
Since precipitation and temperatures that humans can feel are intrinsically linked through humidity, the investigations also covered if there might be a possible link between rising precipitation levels and suicide risks as well.
Statewide Yearly Data
Similar to the temperature data, scatter plots were used to observe trends in precipitation over time. Smooth regression and local regression were performed to see which regression model is a better fit for the dataset, and hence deduce the general trend over time.
ggplot(data=california_climate_overall, aes(x=YEAR, y=PRECIPITATION)) +
geom_line() +
geom_point(aes(color = PRECIPITATION)) +
scale_color_gradient(low = "blue", high = "red") +
labs(title="Graph of Precipitation (Inch) vs Year",
x ="Year", y = "Precipitation (Inch)", color = "Precipitation (Inch)") + #label of x-axis, y-axis and title
theme(plot.title = element_text(hjust = 0.5)) + #center title
stat_smooth(method = "lm",
formula = y ~ x,
geom = "smooth")
ggplot(data=california_climate_overall, aes(x=YEAR, y=PRECIPITATION)) +
geom_line() +
geom_point(aes(color = PRECIPITATION)) +
scale_color_gradient(low = "blue", high = "red") +
labs(title="Graph of Precipitation (Inch) vs Year",
x ="Year", y = "Precipitation (Inch)", color = "Precipitation (Inch)") + #label of x-axis, y-axis and title
theme(plot.title = element_text(hjust = 0.5)) + #center title
geom_smooth(method = "loess",
formula = y ~ x)
From Figure 22, a slight downward trend is observed when smooth regression is used, however, it is difficult to discern and confirm this trend. From Figure 23, the trend seems to be an inverse ‘U’ when local regression is used.
County-Level Yearly Data
Smooth line (best fit) graphs were used to observe general trends in each county.
ggplot(california_county_climate, aes(x = YEAR, y = PRECIPITATION..INCH., color = COUNTY) ) +
geom_smooth(method = "lm", alpha = .15, aes(fill = COUNTY), se = FALSE)+
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6)) + # Adjust the size of the legend text
labs(title = "Graph of Precipitation Best Fit Lines vs Year, Grouped by County",
x = "Year",
y = "Precipitation (Inch)")
## `geom_smooth()` using formula = 'y ~ x'
While smooth line graphs allow for better visualization of data, it is still difficult to discern the trends (especially in the center) where the different lines (representing different counties) coincide with each other. It is also noteworthy that there are opposing trends (i.e. some counties experience an increase in precipitation while some experiencing a decrease).
Scatter plots, along with facet_wrap() function, were used to view data of individual counties and prevent the over-cluttering of data points.
ggplot(data=california_county_climate, aes(x=YEAR, y=PRECIPITATION..INCH., color=PRECIPITATION..INCH.)) +
geom_point()+
geom_smooth(method = "lm", se = FALSE, color = "red", size = 0.5, alpha = 0.6)+
facet_wrap(~COUNTY, ncol = 29, scales = "fixed", shrink = TRUE)+
theme(strip.text = element_text(size = 6), # Adjust the size of the facet labels and titles
axis.text.x = element_text(size = 3))+
theme(legend.position = "none")+
labs(title = "Scatter Plot of Precipitation vs Year, Grouped by County",
x = "Year",
y = "Precipitation")+
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
Given the scattered nature of the dataset, it is not easy to discern the different trends. However, based on the red best-fit lines, it reaffirms the idea that different counties are experiencing different precipitation trends over time.
Next, we looked at mental health data using the indicators - suicide rate (per 100,000) and the rate of hospitalization for mental health issues (per 1,000). These data include both statewide yearly data as well as county-level yearly data. We chose these two measures given their accessibility and availability.
Statewide Yearly Data
A line graph was used to observe the trend of suicide rate (per 100,000) over the years.
california_suicide_overall = read.csv("proj data/[Overall] California Suicide Incidence and Rate.csv")
ggplot(data=california_suicide_overall, aes(x=YEAR, y=SUICIDE.RATE..PER.100.000.)) +
geom_point(aes(color = SUICIDE.RATE..PER.100.000.))+
scale_color_gradient(low = "blue", high = "red")+
labs(title="Graph of Suicide Rate (per 100,000 people) vs Year",
x ="Year", y = "Suicide Rate (per 100,000 people)", color = "Suicide Rate (per 100,000 people)")+ #label of x-axis, y-axis and title
#theme(plot.title = element_text(hjust = 0.5))+ #center title
stat_smooth(method = "lm",
formula = y ~ x,
geom = "smooth",
alpha = 0.6,
size = 1,
se = F)+
xlim(1999, 2021)
From the best-fit line plotted, a general increasing trend in suicide rate (per 100,000 people) was observed over the years.
County-Level Yearly Data
Similar to previous sections, we explored other forms of data visualization to discern trends from county-level yearly data.
california_suicide_county = read.csv("proj data/[County Level] California County Suicide Rate.csv")
california_hospital_county = read.csv("proj data/County_Level_Hospitalizations_for_Mental_Health_Issues,_Age_5_19.csv")
One way is using line plots.
ggplot(data = california_suicide_county, aes(x = YEAR, y = SUICIDE.RATE..PER.100.000., group = COUNTY, color = as.factor(COUNTY))) +
geom_line() +
scale_color_discrete() + # Add more colors as needed
labs(title = "Suicide Rate (Per 100,000 People) vs Year",
x = "Year",
y = "Suicide Rate (Per 100,000 People)",
color = "County") +
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6)) + #, # Adjust the size of the legend text
#plot.title = element_text(hjust = 0.5))+
xlim(2010, 2019)+
scale_x_continuous(breaks = seq(2010, 2019, by = 1), labels = as.character(seq(2010, 2019, by = 1)))
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
ggplot(data = california_hospital_county, aes(x = YEAR, y = RATE.PER.1000.PEOPLE, group = COUNTY, color = as.factor(COUNTY))) +
geom_line() +
scale_color_discrete() + # Add more colors as needed
labs(title = "Hospitalization Rate (Per 1000 People) (5-19 years old) vs Year",
x = "Year",
y = "Rate of Hospitalization (Per 1000 People)",
color = "County") +
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6)) +#, # Adjust the size of the legend text
#plot.title = element_text(hjust = 0.5))+
xlim(2010, 2019)+
scale_x_continuous(breaks = seq(2002, 2020, by = 2), labels = as.character(seq(2002, 2020, by = 2)))
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
For the figures above, it is difficult to discern the trend given the fluctuations in data and the sheer number of counties involved.
To allow for better visualization of the data, smooth line plots were generated.
ggplot(california_suicide_county, aes(x = YEAR, y = SUICIDE.RATE..PER.100.000., color = COUNTY) ) +
geom_smooth(method = "lm", alpha = .15, aes(fill = COUNTY), se = F)+
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6)) + # Adjust the size of the legend text
labs(title = "Suicide Rate (Per 100,000 People) Best Fit Lines vs Year, Grouped by County",
x = "Year",
y = "Suicide Rate (Per 100,000 People)")+
scale_x_continuous(breaks = seq(2010, 2019, by = 1), labels = as.character(seq(2010, 2019, by = 1)))
## `geom_smooth()` using formula = 'y ~ x'
ggplot(data = california_hospital_county, aes(x = YEAR, y = RATE.PER.1000.PEOPLE, group = COUNTY, color = as.factor(COUNTY))) +
geom_smooth(method = "lm", alpha = .15, se = F) +
scale_color_discrete() + # Add more colors as needed
labs(title = "Hospitalization Rate (Per 1000 People) (5-19 years old) vs Year",
x = "Year",
y = "Rate of Hospitalization for Mental Health Issues (Per 1000 People)",
color = "County") +
theme(legend.key.size = unit(0.3, "cm"), # Adjust the size of the legend keys (symbols)
legend.text = element_text(size = 6), # Adjust the size of the legend text
axis.text = element_text(size = 6))+
xlim(2010, 2019)+
scale_x_continuous(breaks = seq(2002, 2020, by = 1), labels = as.character(seq(2002, 2020, by = 1)))
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
## `geom_smooth()` using formula = 'y ~ x'
We noted from the figures above that it is much easier to discern the trends using this type of visualization. Different counties have different trends, as seen by the combination of upward and downward lines.
Scatter plots, along with the facet_wrap() function, were used to view data of individual counties and prevent the over-cluttering of data points.
ggplot(california_suicide_county, aes(x = YEAR, y = SUICIDE.RATE..PER.100.000., color = COUNTY))+
geom_point()+
facet_wrap(~COUNTY, ncol = 29, scales = "fixed", shrink = TRUE)+
geom_smooth(method = "lm", se = FALSE, color = "black", size = 0.5, alpha = 0.6)+
theme(strip.text = element_text(size = 6), # Adjust the size of the facet labels and titles
axis.text.x = element_text(size = 3))+
theme(legend.position = "none")+
labs(title = "Suicide Rates (Per 100,000 People) vs Year, Grouped by County",
x = "Year",
y = "Suicide Rates (Per 100,000 People)") +
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
ggplot(california_hospital_county, aes(x = YEAR, y = RATE.PER.1000.PEOPLE, color = COUNTY))+
geom_point()+
geom_smooth(aes(group = COUNTY), method = "lm", se = FALSE, color = "black", size = 0.5, alpha = 0.6)+
facet_wrap(~ COUNTY, ncol = 29, scales = "fixed", shrink = TRUE) + # Adjust the number of columns as per your preference
theme(strip.text = element_text(size = 6), # Adjust the size of the facet labels and titles
axis.text.x = element_text(size = 3))+
theme(legend.position = "none")+
labs(title = "Hospitalization Rate (Per 1000 People) vs Year, Grouped by County",
x = "Year",
y = "Rate of Hospitalization for Mental Health Issues (Per 1000 People)") +
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
Figures 31 and 32 reaffirm that different counties have different trends, as seen by the combination of upward and downward trends. It is also seen that a considerable number of counties seems to exhibit a positive trend (for hospitalization) - with rate of hospitalization increasing with time.
To understand the relationship/association between temperature and mental health, correlation graphs were used to summarize the degree and direction of correlation between the different variables. This was first done using statewide yearly data.
combined_data = read.csv("proj data/[Combined] Calfornia County Suicide and Temperature Data.csv")
correlation_matrix_county <- cor(combined_data)
ggcorrplot(correlation_matrix_county, type = "full", outline.color = "white",
colors = c("#FC4E07", "white", "#00AFBB"),
lab = TRUE,
lab_size = 1.5,
tl.cex = 6)
ggcorrplot(correlation_matrix_county, method = "circle",
colors = c("#FC4E07", "white", "#00AFBB"),
tl.cex = 6)
From the above correlation graphs, we see that the correlation between temperature measures and suicide is positive but weak, while the correlation between precipitation and suicide is negative but weak.
However, if we look at the changes in these variables, we see that there is a negative but weak correlation between change in temperature measures and change in suicide rates. The correlation between change in precipitation and change in suicide rates is negative but very weak.
Similarly, correlation graphs were plotted using county-level yearly data, to observe the correlation at county-level.
combined_data_overall = read.csv("proj data/[Combined] California Overall Suicide and Temperature.csv")
#View(combined_data_overall)
correlation_matrix_overall <- cor(combined_data_overall)
#View(correlation_matrix)
#str(combined_data_overall)
ggcorrplot(correlation_matrix_overall, type = "full", outline.color = "white",
colors = c("#FC4E07", "white", "#00AFBB"),
lab = TRUE,
lab_size = 1.5,
tl.cex = 6)
ggcorrplot(correlation_matrix_overall, method = "circle",
colors = c("#FC4E07", "white", "#00AFBB"),
tl.cex = 6)
Unlike the plots that used statewide data, the correlation graphs from the county-level data showed a negative but weak correlation between temperature measures and suicide. The correlation between precipitation and suicide was still seen to be negative but weak. However, such comparison does not take into consideration that different counties have different baseline temperatures
If we look at the changes in this variables, we see that there is a positive but very weak correlation between the change in temperature measures and the change in suicide rates. The correlation between the change in precipitation and the change in suicide rates positive but very weak.
Data wrangling was done to consolidate the different datasets before we could explore the county-level data deeper. This includes joining the datasets, removing unnecessary data points (eg. NA rows) and renaming columns for easier referencing.
df1 <- read.csv("proj data/County_Level_California_Avg,_Mix,_Max_Temp_and_Precipitation.csv")
df2 <- read.csv("proj data/[County Level] California County Suicide Rate.csv")
df3 <- read.csv("proj data/County_Level_Hospitalizations_for_Mental_Health_Issues,_Age_5_19.csv")
#full <- read.csv("proj data/[Combined] Calfornia County Suicide and Temperature Data.csv")
full_county <- list(df2,df3,df1) %>%
reduce(left_join, by = c("YEAR", "COUNTY"))
#head(full_county)
#full_county <- subset(full_county, select = -c(X, X.1))
colnames(full_county) <- colnames(full_county) %>%
str_replace_all("\\.", "_") %>%
tolower()
colnames(full_county) = gsub("_$", "", colnames(full_county))
colnames(full_county) = gsub("__", "_", colnames(full_county))
colnames(full_county)[colnames(full_county) == "rate_per_1000_people"] = "hospitalizations_rate_per_1000_people"
colnames(full_county)[colnames(full_county) == "change_in_rate"] = "change_in_h_rate"
full_county <- na.omit(full_county)
#head(full_county)
A preliminary plot was generated to understand the general relationship (degree and direction) between the variables of interest. We looked at the average temperature, precipitation, suicide rate (per 100,000) and hospitalization rate (per 1000) in one plot, and the change in these variables in another plot.
preliminary_plot1 <- subset(full_county, select = c(5,7,10,15))
#head(preliminary_plot1)
plot(preliminary_plot1, main = "Relations between Variables")
We were interested to understand more about the relationship between the independent variables (temperature and precipitation) and the dependent variables (mental health data). From the first preliminary plot, the trends for certain relationships (eg. precipitation and suicide rate) were more distinct, while others (eg. temperature vs hospitalization rate) were inconclusive.
preliminary_plot2 <- subset(full_county, select = c(6,8,16,19))
#head(preliminary_plot2)
plot(preliminary_plot2, main = "Relations between Change in Variables")
Compared to the first preliminary plot, the second preliminary plot showed slightly clearer trends. For example, a seemingly linear relationship can be seen between the change in suicide rate and the change in average temperature.
Next, scatter plots were done for each pair of variables, along with a regression line to better visualize the relationship between two variables. We also experimented with the visualizations by using colours for different counties and also by using the facet_wrap() function.
3.2.1 Average Temperature vs Change in Suicide Rate
ggplot(full_county, aes(x=change_in_avg_temp, y=change_in_suicide_rate, color = county)) +
geom_point(size=1.2) +
geom_line(linewidth=0.7) +
theme(legend.position = "none") +
labs(x = "Change in Average Temp (C)", y = "Change in Suicide Rate (per 100,000 people)", title = "Change in Suicide Rate against Change in Average Temp (C)")
It is difficult to distinguish clear relationship from the above plot, due to the overlapping points and large number of counties. Hence, it may be better to visualize using the facet_wrap() function, and be using regression lines to observe overall trend for each county.
ggplot(full_county, aes(x=change_in_avg_temp, y=change_in_suicide_rate)) +
geom_point() +
geom_smooth(aes(group=county), method = "lm", se = FALSE, color = "red", linewidth = 1, alpha = 0.6) +
facet_wrap(~county) +
theme(legend.position = "none") +
labs(x = "Change in Average Temp (C)", y = "Change in Suicide Rate (per 100,000 people)", title = "Change in Suicide Rate against Change in Average Temp (C)")
## `geom_smooth()` using formula = 'y ~ x'
From the straight-line linear model, there is a lack of correlation seen in most counties. In some counties (e.g. Amador, Fresno), there seems to be slight level of regression, however the direction seems to vary and the magnitude of change is generally small.
We could also use a linear model analysis to statistically verify the suitability of using a linear model to fit the data points.
suicide_lm <- lm(change_in_suicide_rate~change_in_avg_temp, full_county)
summary(suicide_lm)
##
## Call:
## lm(formula = change_in_suicide_rate ~ change_in_avg_temp, data = full_county)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.453 -2.133 0.023 1.891 91.568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1815 0.4350 0.417 0.677
## change_in_avg_temp 0.4550 0.6597 0.690 0.491
##
## Residual standard error: 9.728 on 498 degrees of freedom
## Multiple R-squared: 0.0009543, Adjusted R-squared: -0.001052
## F-statistic: 0.4757 on 1 and 498 DF, p-value: 0.4907
When we used the county-level dataset as a whole, we saw that there is a very slight positive gradient (0.08803) in the relationship between change in temperature and change in suicide rates. From the p-value (0.0338), we conclude that at 95% significance level, this gradient value is statistically significant. However, the low R^2 value of 0.009015 suggests that change in temperature alone does not explain much of the suicide rates in California.
We could use smooth regression to better account for variations in trends (eg. is there a range of temperatures where effect on suicide rates is larger).
ggplot(full_county, aes(x=change_in_avg_temp, y=change_in_suicide_rate)) +
geom_point() +
geom_smooth(aes(group=county), method = "loess", se = FALSE, color = "red", linewidth = 1, alpha = 0.6) +
facet_wrap(~county) +
theme(legend.position = "none") +
labs(x = "Change in Average Temp (C)", y = "Change in Suicide Rate (per 100,000 people)", title = "Change in Suicide Rate against Change in Average Temp (C)")
## `geom_smooth()` using formula = 'y ~ x'
We observed that the use of smooth regression allows us to visualize the relationship more accurately. It is also interesting to note that the relationship is not necessarily strictly increasing or decreasing for most counties.
3.2.2 Average Temperature vs Change in Hospitalization Rate
ggplot(full_county, aes(x=change_in_avg_temp, y=change_in_h_rate, group = county, color = as.factor(county))) +
geom_point(size=1) +
geom_line() +
labs(y = "Change in Hospitalization Rate (per 1000 people)", title = "Change in Hospitalization Rate against Change in Average Temp (C)") +
scale_x_continuous(name = "Change in Average Temp (C)") +
theme(legend.position = "none")
Similar to the data for change in suicide rates, it is difficult to discern the relationship from the above plot, due to the overlapping points and large number of counties. Hence, it may be better to visualize using the facet_wrap() function, and be using regression lines to observe overall trend for each county.
ggplot(full_county, aes(x=change_in_avg_temp, y=change_in_h_rate)) +
geom_point(size = 1) +
geom_smooth(aes(group=county), method = "lm", se = FALSE, color = "red", linewidth = 1, alpha = 0.6) +
labs(y = "Change in Hospitalization Rate (per 1000 people)", title = "Change in Hospitalization Rate against Change in Average Temp (C)") +
facet_wrap(~county) + # too many counties to see clearly
scale_x_continuous(name = "Change in Average Temp (C)") +
theme(legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'
The linear model shows a lack of correlation, as seen in most counties. In some counties (e.g. Alpine, Mono), there seems to be slight level of regression, however the direction seems to vary and the magnitude is generally small. Likewise, we also ran a linear model analysis for the change in hospitalization rates data.*
hospitalization_lm <- lm(change_in_h_rate~change_in_avg_temp, full_county)
summary(hospitalization_lm)
##
## Call:
## lm(formula = change_in_h_rate ~ change_in_avg_temp, data = full_county)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.53961 -0.37606 -0.03169 0.32826 2.68292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.18803 0.02728 6.894 1.65e-11 ***
## change_in_avg_temp 0.08803 0.04136 2.128 0.0338 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6099 on 498 degrees of freedom
## Multiple R-squared: 0.009015, Adjusted R-squared: 0.007025
## F-statistic: 4.53 on 1 and 498 DF, p-value: 0.03379
We saw a slight positive gradient (0.455) in the relationship between change in temperature and change in hospitalization rates. From the p-value (0.4907), we conclude that at 95% significance level, this gradient value is not statistically significant. The low R^2 value of 0.0009543 also suggests that the change in temperature alone does not explain much of the change in hospitalization rates in California.
ggplot(full_county, aes(x=change_in_avg_temp, y=change_in_h_rate)) +
geom_point(size = 1) +
geom_smooth(aes(group=county), method = "loess", se = FALSE, color = "red", linewidth = 1, alpha = 0.6) +
labs(y = "Change in Hospitalization Rate (per 1000 people)", title = "Change in Hospitalization Rate against Change in Average Temp (C)") +
facet_wrap(~county) + # too ma y counties to see clearly
scale_x_continuous(name = "Change in Average Temp (C)") +
theme(legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'
We observed that using smooth regression allows us to visualize the relationship more accurately. It is also interesting to note that the relationship is not necessarily strictly increasing or decreasing for most counties.
Sentiment analysis of text data can allow us to understand more about how the population views the issue of rising temperatures and can potentially shed light on the mental health of the population as a whole. To gain a better understanding of these areas, we looked into specific words in Google trends, such as “heat” and “insomnia”, terms that can provide us some clues on the potential effects of rising temperatures and whether there is actually increasing interest in the rising temperature.
Increasing searches or interest in keywords may denote increasing concern and worry about the rising temperature. At the same time, increasing searches for key terms like insomnia may suggest an increasing number of people who suffer from poor sleep quality, which could be due to the rising temperatures, as purported by Lohmus (2018). To visualize the effects, we plotted several line graphs of search counts versus time with the data collected.
Trends that increase with rising temperatures over the year can signify areas of increasing concerns/interests that warrant further investigation.
4.1.1 Keyword: Heat
california_searchterm_heat = read.csv("proj data/[Google Trend] Word_ Heat - California.csv")
california_searchterm_heat$Date <- as.Date(paste0(california_searchterm_heat$Date, "-01"))
ggplot(data = california_searchterm_heat, aes(x = Date, y = Count, group = ID))+
geom_line(aes(color = ID))+
geom_point(aes(color = ID))+
geom_smooth(method = "lm", alpha = .15, se = F, size = 0.5)+
labs(title = "Line Plot of Google Trend Search of Heat",
x = "Date",
y = "Count")+
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
From the plots, we can see that there is a positive correlation between rising temperatures and some of the possible consequences that we looked at. However, not all of the possible correlations observed stem from data that is relevant to the negative impacts of rising temperature. For example, a term “Miami Heat” refers to a basketball team rather than the rising temperatures or a possible side effect of rising temperatures. This may suggests that we need a more specific keyword for analysis.
4.1.2 Keyword: Heat Injury
california_searchterm_heatinjury = read.csv("proj data/[Google Trend] Word_ Heat Injury - California.csv")
california_searchterm_heatinjury$Date <- as.Date(paste0(california_searchterm_heatinjury$Date, "-01"))
ggplot(data = california_searchterm_heatinjury, aes(x = Date, y = Count))+
geom_line(aes(color = Date))+
geom_point(aes(color = Date))+
geom_smooth(method = "lm", alpha = .15, se = F, size = 0.6)+
labs(title = "Line Plot of Google Trend Search of Heat Injury",
x = "Date",
y = "Count")+
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
One of the line graphs obtained from the data collected from Google Trends seems to suggest increase in concerns over the risk of heat injuries in California. This could signal increasing worries about increasing temperature.
4.1.3 Keyword: High Temperature
california_searchterm_hightemperature = read.csv("proj data/[Google Trend] Word_ High Temperature - California.csv")
california_searchterm_hightemperature$Date <- as.Date(paste0(california_searchterm_hightemperature$Date, "-01"))
ggplot(data = california_searchterm_hightemperature, aes(x = Date, y = Count))+
geom_line(aes(color = Date))+
geom_point(aes(color = Date))+
geom_smooth(method = "lm", alpha = .15, se = F, size = 0.6)+
labs(title = "Line Plot of Google Trend Search of High Temperature",
x = "Date",
y = "Count")+
theme(plot.title = element_text(hjust = 0.5))+
geom_vline(xintercept = as.numeric(as.Date("2022-09-01")), linetype = "dashed", color = "red", size = 0.6, alpha = 0.6)+
geom_vline(xintercept = as.numeric(as.Date("2006-07-01")), linetype = "dashed", color = "red", size = 0.6, alpha = 0.6)+
annotate("text", x = as.Date("2019-04-01"), y = 55,
label = "Record-breaking heat wave", vjust = -1, color = "red", size = 3)+
annotate("text", x = as.Date("2010-04-01"), y = 55,
label = "2006 North American heat wave", vjust = -1, color = "red", size = 3)
## `geom_smooth()` using formula = 'y ~ x'
After comparing the data collected from Google Trends for both the
search terms “Heat Injury” and “High temperatures”, we can observe
increasing trends in both “Heat Injury” and “High temperatures” with
respect to time. This suggests that there is a strong positive
correlation between “Heat Injury” and “High temperatures”.
Some of the spikes in on the “Heat Injury” line plot also corresponded
with some of worst heatwaves in California that occurred in the years of
2006 and 2022.
4.1.4 Keyword: Insomnia
california_searchterm_insomnia = read.csv("proj data/[Google Trend] Word_ Insomnia - California.csv")
california_searchterm_insomnia$Date <- as.Date(paste0(california_searchterm_insomnia$Date, "-01"))
ggplot(data = california_searchterm_insomnia, aes(x = Date, y = Count))+
geom_line(aes(color = Date))+
geom_point(aes(color = Date))+
geom_smooth(method = "lm", alpha = .15, se = F, size = 0.6)+
labs(title = "Line Plot of Google Trend Search of Insomnia",
x = "Date",
y = "Count")+
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
Another possible noteworthy consequence of rising temperatures based
on our literature review was the negative impact on sleep quality
(Henane, R. et al, 1977)(https://pubmed.ncbi.nlm.nih.gov/188799/). There is also
a notably increasing trend in the amount of searches for the term
“insomnia”, suggesting that an increasing number of people have become
afflicted with the condition or are suffering from poor sleep
quality.
However, we recognize that poor sleep quality could also be attributed
to a combination of other factors, like stresses from either work or
school and relationship problems.
4.1.5 Keyword: Climate Anxiety
california_searchterm_climateAnxiety = read.csv("proj data/[Google Trend] Word_ Climate_Anxiety - California.csv")
california_searchterm_climateAnxiety$Date <- as.Date(paste0(california_searchterm_climateAnxiety$Date, "-01"))
ggplot(data = california_searchterm_climateAnxiety, aes(x = Date, y = Count))+
geom_line(aes(color = Date))+
geom_point(aes(color = Date))+
geom_smooth(method = "lm", alpha = .15, se = F, size = 0.6)+
labs(title = "Line Plot of Google Trend Search of Climate Anxiety",
x = "Date",
y = "Count")+
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
The best fit line seems to suggest a downward trend. However, the general trend exhibited appears unremarkable, with little to no changes from 2010 onwards. It is interesting to note that that there is a large search count for “climate anxiety” around 2005 and a relatively low search count after. We speculate that it could be due to changes in data collection methodologies.
4.1.5 Keyword: Climate Depression
california_searchterm_climateDepression = read.csv("proj data/[Google Trend] Word_ Climate_Depression - California.csv")
california_searchterm_climateDepression$Date <- as.Date(paste0(california_searchterm_climateDepression$Date, "-01"))
ggplot(data = california_searchterm_climateDepression, aes(x = Date, y = Count))+
geom_line(aes(color = Date))+
geom_point(aes(color = Date))+
geom_smooth(method = "lm", alpha = .15, se = F, size = 0.6)+
labs(title = "Line Plot of Google Trend Search of Climate Depression",
x = "Date",
y = "Count")+
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
We noticed a similarly unremarkable trend in this plot, with little to no changes in the trend from 2010 onwards. Similarly, there is a large search count between 2005 and 2007 and relatively low search count after.
We also extracted data from Reddit and conducted analysis for a few keywords like “heat”, “hot”, and “warm” over the different lexicons.
4.2.1 Keyword: Heat
#heat_urls <- find_thread_urls(keywords = "heat", subreddit="California", period = "all")
#heat_comments <- get_thread_content(heat_urls$url)
heat_comments <- read.csv("proj data/heat_comments_reddit.csv")
#head(heat_comments)
tidy_comments <- heat_comments %>%
unnest_tokens(word, comment) %>%
anti_join(stop_words)
## Joining, by = "word"
tidy_comments %<>%
anti_join(stop_words)
## Joining, by = "word"
#View(tidy_comments)
afinn <- tidy_comments %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = timestamp %/% 604800) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing <- tidy_comments %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al.") %>%
count(method, index = timestamp %/% 604800, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
nrc <- tidy_comments %>%
inner_join(get_sentiments("nrc") %>%
# only get positive and negative sentiments
filter(sentiment %in% c("positive", "negative"))) %>%
mutate(method = "NRC") %>%
count(method, index = timestamp %/% 604800, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
all_three <- bind_rows(afinn,
bing,
nrc)
ggplot(all_three, aes(index, sentiment, fill = method)) +
geom_col(aes(fill = method))+
facet_wrap(~method, ncol = 1, scales = "free_y")
Sentiment analysis of the keyword “heat” shows a relatively sparse dataset, especially during the earlier years. There is a lack of general trends to be drawn from the sentiment analysis, even though we can see that there are more extreme positive and negative sentiments regarding the search term “heat” in the later years.
nrc_sentiment <- tidy_comments %>%
inner_join(get_sentiments("nrc"))
## Joining, by = "word"
nrc_sentiment %>%
select(word, sentiment) %>%
head()
## word sentiment
## 1 author positive
## 2 author trust
## 3 tree anger
## 4 tree anticipation
## 5 tree disgust
## 6 tree joy
table(nrc_sentiment$sentiment)
##
## anger anticipation disgust fear joy negative
## 762 947 437 876 770 1506
## positive sadness surprise trust
## 2185 720 470 1272
ggplot(nrc_sentiment, aes(y = sentiment))+
geom_bar(aes(fill = sentiment))+
theme_minimal()+
labs(title = "NRC Sentiments in California Subreddit Posts - Heat")
4.2.2 Keyword: Warm
#warm_urls <- find_thread_urls(keywords = "warm", subreddit="California", period = "all")
#warm_comments <- get_thread_content(warm_urls$url)
warm_comments <- read.csv("proj data/warm_comments_reddit.csv")
tidy_comments <- warm_comments %>%
unnest_tokens(word, comment) %>%
anti_join(stop_words)
## Joining, by = "word"
tidy_comments %<>%
anti_join(stop_words)
## Joining, by = "word"
#View(tidy_comments)
afinn <- tidy_comments %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = timestamp %/% 604800) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing <- tidy_comments %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al.") %>%
count(method, index = timestamp %/% 604800, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
nrc <- tidy_comments %>%
inner_join(get_sentiments("nrc") %>%
# only get positive and negative sentiments
filter(sentiment %in% c("positive", "negative"))) %>%
mutate(method = "NRC") %>%
count(method, index = timestamp %/% 604800, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
all_three <- bind_rows(afinn,
bing,
nrc)
ggplot(all_three, aes(index, sentiment, fill = method)) +
geom_col(aes(fill = method))+
facet_wrap(~method, ncol = 1, scales = "free_y")
Sentiment analysis of the keyword “warm” shows a relatively sparse dataset, especially during the earlier years. Similar to the sentiment analysis for the keyword “hot”, there is a lack of general trends to be drawn from the sentiment analysis, even though we can see that there are more extreme positive and negative sentiments regarding the search term “heat” in the later years, which is also what we observed for the sentiment analysis for the keyword “warm”.
nrc_sentiment <- tidy_comments %>%
inner_join(get_sentiments("nrc"))
## Joining, by = "word"
nrc_sentiment %>%
select(word, sentiment) %>%
head()
## word sentiment
## 1 juvenile negative
## 2 love joy
## 3 love positive
## 4 sting anger
## 5 sting fear
## 6 sting negative
table(nrc_sentiment$sentiment)
##
## anger anticipation disgust fear joy negative
## 610 834 487 538 831 1227
## positive sadness surprise trust
## 1969 440 365 1094
ggplot(nrc_sentiment, aes(y = sentiment))+
geom_bar(aes(fill = sentiment))+
theme_minimal()+
labs(title = "NRC Sentiments in California Subreddit Posts - Warm")
4.2.3 Keyword: Hot
#hot_urls <- find_thread_urls(keywords = "hot", subreddit="California", period = "all")
#hot_comments <- get_thread_content(hot_urls$url)
hot_comments <- read.csv("proj data/hot_comments_reddit.csv")
tidy_comments <- hot_comments %>%
unnest_tokens(word, comment) %>%
anti_join(stop_words)
## Joining, by = "word"
tidy_comments %<>%
anti_join(stop_words)
## Joining, by = "word"
#View(tidy_comments)
afinn <- tidy_comments %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = timestamp %/% 604800) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing <- tidy_comments %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al.") %>%
count(method, index = timestamp %/% 604800, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
nrc <- tidy_comments %>%
inner_join(get_sentiments("nrc") %>%
# only get positive and negative sentiments
filter(sentiment %in% c("positive", "negative"))) %>%
mutate(method = "NRC") %>%
count(method, index = timestamp %/% 604800, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
all_three <- bind_rows(afinn,
bing,
nrc)
ggplot(all_three, aes(index, sentiment, fill = method)) +
geom_col(aes(fill = method))+
facet_wrap(~method, ncol = 1, scales = "free_y")
Sentiment analysis of the keyword “hot” also shows a relatively sparse dataset, especially during the earlier years, similar to the keywords “heat” and “warm”. However, unlike the earlier two keywords, sentiment analysis yields largely positive sentiments regardless of the lexicon used. This might be due to the many ways that Reddit users could use the word “hot”, that might be associated with positive sentiments, that may not be relevant for this investigation.
nrc_sentiment <- tidy_comments %>%
inner_join(get_sentiments("nrc"))
## Joining, by = "word"
nrc_sentiment %>%
select(word, sentiment) %>%
head()
## word sentiment
## 1 beach joy
## 2 start anticipation
## 3 favorite joy
## 4 favorite positive
## 5 favorite trust
## 6 seals trust
table(nrc_sentiment$sentiment)
##
## anger anticipation disgust fear joy negative
## 49 71 26 65 61 111
## positive sadness surprise trust
## 157 44 25 99
ggplot(nrc_sentiment, aes(y = sentiment))+
geom_bar(aes(fill = sentiment))+
theme_minimal()+
labs(title = "NRC Sentiments in California Subreddit Posts - Hot")
4.2.4 Keyword: Temperature
#temperature_urls <- find_thread_urls(keywords = "temperature", subreddit="California", period = "all")
#temperature_comments <- get_thread_content(temperature_urls$url)
temperature_comments <- read.csv("proj data/temperature_comments_reddit.csv")
tidy_comments <- temperature_comments %>%
unnest_tokens(word, comment) %>%
anti_join(stop_words)
## Joining, by = "word"
tidy_comments %<>%
anti_join(stop_words)
## Joining, by = "word"
#View(tidy_comments)
afinn <- tidy_comments %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = timestamp %/% 604800) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing <- tidy_comments %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al.") %>%
count(method, index = timestamp %/% 604800, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
nrc <- tidy_comments %>%
inner_join(get_sentiments("nrc") %>%
# only get positive and negative sentiments
filter(sentiment %in% c("positive", "negative"))) %>%
mutate(method = "NRC") %>%
count(method, index = timestamp %/% 604800, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
all_three <- bind_rows(afinn,
bing,
nrc)
ggplot(all_three, aes(index, sentiment, fill = method)) +
geom_col(aes(fill = method))+
facet_wrap(~method, ncol = 1, scales = "free_y")
Sentiment analysis of the keyword “temperature” also shows a relatively sparse dataset, especially during the earlier years, similar to the keywords “heat” and “warm”. Similar to the keyword “hot”, sentiment analysis yields largely positive sentiments regardless of the lexicon, which might be due to alternative meanings that extend beyond the scope of our investigations as well. The magnitudes of both positive and negative sentiments seem to taper off through the years here, as opposed to the keyword “hot”.
nrc_sentiment <- tidy_comments %>%
inner_join(get_sentiments("nrc"))
## Joining, by = "word"
nrc_sentiment %>%
select(word, sentiment) %>%
head()
## word sentiment
## 1 juvenile negative
## 2 love joy
## 3 love positive
## 4 sting anger
## 5 sting fear
## 6 sting negative
table(nrc_sentiment$sentiment)
##
## anger anticipation disgust fear joy negative
## 844 1031 643 832 760 1669
## positive sadness surprise trust
## 2046 698 517 1100
ggplot(nrc_sentiment, aes(y = sentiment))+
geom_bar(aes(fill = sentiment))+
theme_minimal()+
labs(title = "NRC Sentiments in California Subreddit Posts - Temperature")
Sentiment analysis (with respect to the NRC lexicon) for the four keywords show similar results, with the strongest sentiment being “positive” for all four keywords, with the next strongest sentiment being “negative” sentiment.
ProQuest TDM was also used to generate visualisations on Geographical Analysis and Sentiment Analysis for the search term “Rising Temperatures”. Our visualisations are based on a dataset generated from a compilation of Los Angeles Times articles from 1886-1922, and 1923-1995, and Los Angeles Sentinel articles from 1934. The dataset has a total of 8901 documents.
Based on the ProQuest Geographical Analysis, most of the news that matched the search term came from California. From this, it is fair to conclude that most of the sentiment analysis regarding “rising temperatures” comes from Californian residents.
Based on the ProQuest Sentiment Analysis, we can look at the emotions
associated with mental health:
- [Fear] Difficult to discern a clear trend as it seems
to be fluctuating from 1920 to 2020
- [Happiness] Difficult to discern a clear trend as it
seems to be fluctuating. However, there seems to be a steady increase
from 1980 onwards
- [Sadness] Difficult to discern a clear trend as it
seems to be fluctuating. However, there seems to be a steady increase
from 1980 onwards, similar to happiness.
| No. | Final summary on the methods used in the investigation: |
|---|---|
| 1. | Literature review during the initial investigations highlighted that there were two conflicting views involved in the analysis of the impact of rising temperatures. |
| 2. | Visualisations and investigation of certain trends using ggplot did find a general increase in the trends of both temperatures and their associated impacts. |
| 3. | Correlation studies proved positive, but weak, correlation between rising temperatures and suicide rates in California. |
| 4. | Sentiment analysis have limited usefulness in determining emotions associated with high temperatures with respect to time, but does also provide some insights to sentiments and concerns regarding rising temperatures and clues to how they may link to mental health. |
6.1.1 Spatial Analysis: Suicide Rates vs Average Temperature
Suicide Rates
counties <- counties(state = "CA")
## Retrieving data for the year 2021
#ggplot(counties)+
# geom_sf()
california_suicide_spatial = read.csv("proj data/[County Level] California County Suicide Rate.csv")
counties %<>%
left_join(california_suicide_spatial, by = c("NAME" = "COUNTY"))
counties %<>%
drop_na("YEAR") #drop any NA values
#View(counties)
ggplot(counties)+
geom_sf(aes(fill = `SUICIDE.RATE..PER.100.000.`))+
scale_fill_gradient(high = "Blue", low = "White")+
facet_wrap(~YEAR, ncol = 5)+
labs(title = "Spatial Plot of California counties representing Suicide Rates over the years", fill = "Suicide Rate (Per 100,000 People)") +
theme(axis.text = element_text(size = 6))
#creating different ggplots showcasing suicide rates per 100,000, separated by year
Nothing really remarkable, just that Northern area seems to suffer a higher rate of suicides as compared to the Southern area, as evident by the whiter shade at the South and redder shade at the North from 2010 to 2019.
Average Temperature
counties_1 <- counties(state = "CA")
## Retrieving data for the year 2021
#ggplot(counties_1)+
# geom_sf()
california_temperature_spatial = read.csv("proj data/County_Level_California_Avg,_Mix,_Max_Temp_and_Precipitation.csv")
counties_1 %<>%
left_join(california_temperature_spatial, by = c("NAME" = "COUNTY"))
counties_1 %<>%
drop_na("YEAR") %>% #drop any NA values
filter(YEAR >= 2010 & YEAR <= 2019)
#View(counties_1)
ggplot(counties_1)+
geom_sf(aes(fill = `AVERAGE.TEMP..C.`))+
scale_fill_gradient(high = "Red", low = "White")+
facet_wrap(~YEAR, ncol = 5)+
labs(title = "Spatial Plot of California counties representing Average Temperature over the years", fill = "Average Temperature (C)") +
theme(axis.text = element_text(size = 6))
#creating different ggplots showcasing average temperature (C), separated by year
While the spatial analysis of the suicide rates alone did not yield interesting results, when you consider the spatial analysis of suicide rates and average temperature, it seems interesting to note that the northern sides (which has a higher suicide rate) seems to have a lower temperature as compared to the southern side (which has a lower suicide rate).
6.1.2 Spatial Analysis: Change in Suicide Rates vs Change in Average Temperature
Change in Suicide Rates
ggplot(counties)+
geom_sf(aes(fill = `CHANGE.IN.SUICIDE.RATE`))+
scale_fill_gradient(high = "Blue", low = "White")+
facet_wrap(~YEAR, ncol = 5)+
labs(title = "Spatial Plot of California counties' Change in Suicide Rates over the years", fill = "Change in Suicide Rate (Per 100,000 People)")+
theme(axis.text = element_text(size = 6))
It is difficult to distinguish changes in suicide rates in different counties as they are look relatively the same (i.e. similar shades of purple).
Change in Average Temperature
ggplot(counties_1)+
geom_sf(aes(fill = `CHANGE.IN.AVG.TEMP`))+
scale_fill_gradient(high = "Red", low = "White")+
facet_wrap(~YEAR, ncol = 5)+
labs(title = "Spatial Plot of California counties' Average Temperature over the years", fill = "Average Temperature (C)")+
theme(axis.text = element_text(size = 6))
The changes in average temperature seems rather erratic and happens rather uniformly across counties.
We conclude that it is difficult to establish potential association between the changes in the two factors.
Recognizing the lack of data set from the Reddit API, we also explored using Guardian API to extract news articles, related to the two keywords “California” and “Heat”, from the Guardian. At the same time, we made sure to narrow down the section to “US News” given our interests in understanding sentiments towards the rising temperature among people in California. Through sentiment analysis, we hope to uncover clues pertaining to how rising temperatures can affect mental health, since emotions like fear, negative and joy are closely related to one’s mental health and wellbeing.
*Note: images of results are attached because there was difficulties in knittinf for the GuardianAPI section.# gu_api_key()
# ca_temperature <- gu_content('"California" AND "heat"', from_date = "2013-01-01", sectionId = "us-news")
# write_csv(ca_temperature, file = "proj data/guardian_data_heat.csv")
# ca_temperature_us <- ca_temperature %>%
# filter(section_name == "US news")
# tidy_cc <- ca_temperature_us %>%
# unnest_tokens(word, body_text)
#
# tidy_cc %<>%
# anti_join(stop_words)
#
# afinn <- tidy_cc %>%
# inner_join(get_sentiments("afinn")) %>%
# mutate(web_publication_date_numeric = as.numeric(web_publication_date)) %>%
# group_by(index = floor(web_publication_date_numeric / 604800)) %>%
# summarise(sentiment = sum(value)) %>%
# mutate(web_publication_date_interval = as.POSIXct(index * 604800, origin = "1970-01-01", tz = "UTC")) %>%
# mutate(method = "AFINN")
#
# ggplot(afinn, aes(index, sentiment, fill = method)) +
# geom_col()
# bing <- tidy_cc %>%
# inner_join(get_sentiments("bing")) %>%
# mutate(method = "Bing et al.") %>%
# mutate(web_publication_date_numeric = as.numeric(web_publication_date)) %>%
# count(method, index = floor(web_publication_date_numeric %/% 604800), sentiment) %>%
# pivot_wider(names_from = sentiment,
# values_from = n,
# values_fill = 0) %>%
# mutate(web_publication_date_interval = as.POSIXct(index * 604800, origin = "1970-01-01", tz = "UTC")) %>%
# mutate(sentiment = positive - negative)
#
# nrc <- tidy_cc %>%
# inner_join(get_sentiments("nrc") %>%
# # only get positive and negative sentiments
# filter(sentiment %in% c("positive", "negative"))) %>%
# mutate(method = "NRC") %>%
# mutate(web_publication_date_numeric = as.numeric(web_publication_date)) %>%
# count(method, index = floor(web_publication_date_numeric %/% 604800), sentiment) %>%
# pivot_wider(names_from = sentiment,
# values_from = n,
# values_fill = 0) %>%
# mutate(web_publication_date_interval = as.POSIXct(index * 604800, origin = "1970-01-01", tz = "UTC")) %>%
# mutate(sentiment = positive - negative)
# all_three <- bind_rows(afinn,
# bing,
# nrc)
#
# ggplot(all_three, aes(index, sentiment, fill = method)) +
# geom_col(aes(fill = method))+
# facet_wrap(~method, ncol = 1, scales = "free_y")
Based on the
sentiment analysis with reference to the AFINN and Bing Lexicon, we
noted the generally negative sentiment towards the heat in California.
However, we were also surprised to see a seemingly parabolic trend in
the graphs, which would be an area worth exploring in the future.
# tidy_body_text <- tidy_cc%>%
# inner_join(get_sentiments("nrc"))
#
# table(tidy_body_text$sentiment)
#
# ggplot(tidy_body_text, aes(y = sentiment))+
# geom_bar(aes(fill = sentiment))+
# theme_minimal()+
# labs(title = "Sentiments in Guardian Posts with Keyword 'Heat' and 'California', US News ")
# tidy_body_text_1 <- tidy_cc%>%
# inner_join(get_sentiments("bing"))
#
# table(tidy_body_text_1$sentiment)
#
# ggplot(tidy_body_text_1, aes(y = sentiment))+
# geom_bar(aes(fill = sentiment))+
# theme_minimal()+
# labs(title = "Sentiments in Guardian Posts with Keyword 'Heat' and 'California', US News ")
# tidy_body_text_2 <- tidy_cc%>%
# inner_join(get_sentiments("afinn"))
#
# table(tidy_body_text_2$sentiment)
#
# ggplot(tidy_body_text_2, aes(y = value))+
# geom_bar(aes(fill = as.factor(value)))+
# theme_minimal()+
# labs(title = "Sentiments in Guardian Posts with Keyword 'Heat' and 'California', US News ")
| Name | Tasks |
|---|---|
| Ernest | - Ideation of topic, writing of Final Report Proposal |
| - Conducted Literature Review and summarized key pointers from research papers | |
| - Data Collection, Graph Plotting, Data Analysis and Coding (Trends, Sentiment Analysis, Annex) | |
| - Presentation slides creation | |
| - Report Writing (Literature Review) | |
| Zhi Yan | - Ideation of topic, writing of Final Report Proposal |
| - Data Analysis, Graph Plotting, Coding (Regression) | |
| - Report Writing (Overall) | |
| Zheng Yang | - Ideation of topic, writing of Final Report Proposal |
| - Data Collection | |
| - Report Writing (Literature Review) |